摘要 :
Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer, most of which only provide point-wise operations. Skiplist-based store can support both point operatio...
展开
Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer, most of which only provide point-wise operations. Skiplist-based store can support both point operations and range queries, but its CPU-intensive access operations combined with the high-speed network will easily lead to the storage layer reaches CPU bottlenecks. The common solution to this problem is offloading some operations into the application layer and using RDMA bypassing CPU to directly perform remote access, but this method is only used in the hash table-based store. In this paper, we present RS-store, a skiplist-based key-value store with RDMA, which can overcome the CPU handle of the storage layer by enabling two access modes: local access and remote access. In RS-store, we redesign a novel data structure R-skiplist to save the communication cost in remote access, and implement a latch-free concurrency control mechanism to ensure all the concurrency during two access modes. RS-store also supports client-active range query which can reduce the storage layer's CPU consumption. At last, we evaluate RS-store on an RDMA-capable cluster. Experimental results show that RS-store achieves up to 2x improvements over RDMA-enabled RocksDB on the throughput and application's scalability.
收起
摘要 :
We introduce an interleaving operational semantics for describing the client-observable behaviour of atomic transactions on distributed key-value stores. Our semantics builds on abstract states comprising centralised, global key-v...
展开
We introduce an interleaving operational semantics for describing the client-observable behaviour of atomic transactions on distributed key-value stores. Our semantics builds on abstract states comprising centralised, global key-value stores and partial client views. Using our abstract states, we present operational definitions of well-known consistency models in the literature, and prove them to be equivalent to their existing declarative definitions using abstract executions. We explore two applications of our operational framework: 1) verifying that the COPS replicated database and the Clock-SI partitioned database satisfy their consistency models using trace refinement, and 2) proving invariant properties of client programs.
收起
摘要 :
This paper describes the design and implementation of HERD, a key-value system designed to make the best use of an RDMA network. Unlike prior RDMA-based key-value systems, HERD focuses its design on reducing network round trips wh...
展开
This paper describes the design and implementation of HERD, a key-value system designed to make the best use of an RDMA network. Unlike prior RDMA-based key-value systems, HERD focuses its design on reducing network round trips while using efficient RDMA primitives; the result is substantially lower latency, and throughput that saturates modern, commodity RDMA hardware. HERD has two unconventional decisions: First, it does not use RDMA reads, despite the allure of operations that bypass the remote CPU entirely. Second, it uses a mix of RDMA and messaging verbs, despite the conventional wisdom that the messaging primitives are slow. A HERD client writes its request into the server's memory; the server computes the reply. This design uses a single round trip for all requests and supports up to 26 million key-value operations per second with 5 us average latency. Notably, for small key-value items, our full system throughput is similar to native RDMA read throughput and is over 2X higher than recent RDMA-based key-value systems. We believe that HERD further serves as an effective template for the construction of RDMA-based datacenter services.
收起
摘要 :
Although distributed key-value store is becoming increasingly popular in compensating the conventional distributed file systems, it is often criticized due to its costly full-size replication for high availability that causes high...
展开
Although distributed key-value store is becoming increasingly popular in compensating the conventional distributed file systems, it is often criticized due to its costly full-size replication for high availability that causes high I/O overhead. This paper presents two techniques to mitigate such I/O overhead and improve key-value store performance: GPU encoding and locality-aware encoding. Instead of migrating full-size replicas over the network, we split the original file into smaller chunks and encode them with a few additional parity codes using GPUs before dispersing them onto remote nodes. The parity code is usually much smaller than the original file, which saves the extra space required for high availability and reduces the I/O overhead. Meanwhile, the compute-intensive encoding process is largely accelerated by the massive number of GPU cores. Yet, splitting the original file into smaller chunks stored on multiple nodes breaks data locality from application's perspective. To this end, we present a locality-aware encoding mechanism that allows a job to be dispatched as finer-grained tasks right on the node where the required chunk resides. Therefore, the data locality is preserved at the finer granularity of sub-job (i.e., task) level. We conduct an in-depth analysis of the proposed approach and implement a system prototype named Gest. Gest has been deployed and evaluated on a variety of testbeds demonstrating that high data availability, high space efficiency, and high I/O performance could be collectively achieved at the same time.
收起